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Abstract 

Fortran  and  C++  are  the  dominant  programming  languages  used  in  scientific  computation. 
Consequently,  extensions  to  these  languages  are  the  most  popular  for  programming  massively 
parallel  computers.  We  discuss  two  such  approaches  to  parallel  Fortran  and  one  approach  to 
C++.  The  High  Performance  Fortran  Forum  has  designed  HPF  with  the  intent  of  supporting 
data  parallelism  on  Fortran  90  applications.  HPF  works  by  asking  the  user  to  help  the  compiler 
distribute  and  align  the  data  structures  with  the  distributed  memory  modules  in  the  system. 
Fortran-S  takes  a  different  approach  in  which  the  data  distribution  is  managed  by  the  operating 
system  and  the  user  provides  annotations  to  indicate  parallel  control  regions.  In  the  case  of 
C++,  we  look  at  pC++  which  is  based  on  a  concurrent  aggregate  parallel  model. 
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1  Introduction 


Exploiting  the  full  potentiaJ  of  parallel  architectures  requires  a  cooperative  effort  between  the  user 
and  the  language  system.  There  is  a  clear  trade-off  between  the  amount  of  information  the  user 
has  to  provide  and  the  amount  of  effort  the  compiler  has  to  expend  to  generate  optimal  parallel 
code.  At  one  end  of  the  spectrum  are  low-level  languages  where  the  user  has  full  control  and  has 
to  provide  all  the  details  while  the  compiler  effort  is  minimal.  At  the  other  end  of  the  spectrum  is 
sequential  languages  where  the  compiler  has  the  full  responsibility  for  extracting  the  parallelism. 
Clearly,  there  are  advantages  and  disadvantages  to  both  approaches. 

Explicit-tasking  Languages 

Current  programming  environments  for  parallel  machines  follow  the  first  approach  providing  low- 
level  constructs  such  as  message-passing  primitives  as  their  principal  language  constructs.  In  such 
programming  environments,  an  algorithm  is  specified  as  a  set  of  sequential  processes  which  exe¬ 
cute  concurrently,  synchronizing  and  sharing  data  explicitly  via  messages.  Since  such  environments 
directly  reflect  the  underlying  hardware,  such  an  explicit-tasking  approach  allows  the  user  to  effec¬ 
tively  exploit  the  full  potential  of  <^lie  machine. 

However,  for  data  parallel  algorithms,  as  one  typically  finds  in  scientific  programming,  such 
environments  have  proven  quite  awkward  to  use.  The  basic  issue  is  that  programmers  tend  to 
think  in  terms  of  synchronous  manipulation  of  distributed  data  structures,  such  as  grids,  matrices, 
and  so  forth,  while  the  languages  available  provide  no  corresponding  language  constructs.  The 
hardware  support  for  most  current  architectures  is  such  that  locality  of  data  is  critical  for  good 
performance.  Thus,  the  programmer  must  decompose  each  data  structure  into  a  collection  of 
pieces,  with  each  piece  “owned”  by  a  single  processor.  All  interactions  between  different  parts  of 
the  data  structure  must  then  be  explicitly  specified  using  the  low-level  data  sharing  constructs  such 
as  message-passing  statements  supported  by  the  language. 

Decomposing  aU  data  structures  in  this  way,  and  specifying  communication  explicitly,  leads  to 
programs  which  can  be  extraordinarily  complicated.  Experience  has  shown  that  message-passing 
versions  of  algorithms  can  be  five  to  ten  times  longer  than  the  sequential  version.  This  code  expan¬ 
sion  hides  the  original  algorithm  among  the  details  of  low-level  communications.  Programs  written 
in  low-level  languages  also  tend  to  be  highly  inflexible,  since  the  partitioning  of  the  data  struc¬ 
tures  across  the  processors  must  be  incorporated  in  all  parts  of  the  program.  Each  operation  on  a 
distributed  data  structure  turns  into  a  sequence  of  “send”  and  “receive”  operations  intricately  em¬ 
bedded  in  the  code.  This  “hard  wires”  all  algorithm  choices,  inhibiting  exploration  of  alternatives, 
as  well  as  making  the  parallel  ])rogram  difficult  to  design  and  debug. 

Direct  Compilation  of  Conventional  Languages 

The  second  approach  to  programming  multiprocessors,  direct  compilation  of  conventional  languages 
for  parallel  execution,  provides  .a  number  of  important  advantages.  First,  it  allows  programmers 
to  continue  using  familiar  languages  as  they  move  to  newer  and  more  complex  machines.  Second, 
there  is  a  large  body  of  existing  programs  which  ran  be  transported  to  parallel  architectures  without 
change.  Third,  the  details  of  the  target  architecture  are  invisible  to  the  programmer,  so  the  complex 


1 


load-balancing  and  program  design  issues,  which  must  be  faced  with  the  explifil  ta-sking  languages, 
are  not  present. 

This  approach  is,  in  a  real  sense,  a  direct  outgrowth  of  successful  research  in  construction  of 
vectorizing  compilers,  and  is  currently  being  actively  explored  by  several  research  groups  [6,  7. 
26,  51,  56,  63,  65],  Since  the  millions  of  lines  of  existing  sequential  programs  cannot  be  easily 
replaced,  nor  are  they  readily  modifiable,  there  is  clear  importance  to  this  approach,  and  it  will 
surely  continue. 

There  are,  however,  a  number  of  difficulties  with  this  approach.  The  major  one  i.s  that  the 
semantics  of  conventional  languages  strongly  reflects  the  sequential  von  Neumann  architecture, 
making  the  task  of  automatic  restructuring  very  difficult.  Extracting  parallelism  from  such  pro 
grams  requires  very  aggressive  data-flow  analysis  including  array  subsection  and  inter-procediiral 
analysis.  Moreover,  existing  languages,  especially  Fortran,  encourage  programming  styles  which 
make  it  extremely  difficult  for  compilers  to  extract  much  parallelism.  Freely  “equivalenced"  arrays, 
and  passing  of  “pointers”  to  simulate  dynamic  allocation,  severely  limit  the  compiler’s  ability  to 
extract  parallelism. 

Also,  once  the  parallelism  has  been  exposed,  it  has  to  be  mapped  onto  the  target  architecture. 
The  appropriate  mapping,  including  the  distribution  of  data  and  work  across  the  proces.sors.  is 
critically  dependent  on  the  characteristics  of  the  program  and  also  that  of  the  target  machine. 
Because  the  general  mapping  problem  has  been  shown  to  be  NP-complete  the  heuristic  algorithms 
used  tend  to  generate  sub-optimal  code.  Given  all  these  problems,  the  end  result  seems  to  be 
that  direct  compilation  of  sequential  languages  can  extract  only  modest  amounts  of  loop-level 
p;  raUelisiu. 

The  Alternative:  Modest  Language  Extensions 

As  argued  above,  the  state  of  the  art  in  advanced  compiler  design  is  not  yet  up  to  the  task  of 
parallelizing  a  sequential  application  for  execution  on  a  massively  parallel  system  with  a  complex 
memory  hierarchy.  (Consequently,  the  programmer  must  participate  in  this  process.  While  there  is 
a  wide  variety  of  new  parallel  programming  languages  that  help  solve  this  problem,  we  will  focus 
attention  on  three  approaches  based  on  modest  extensions  and  annotation  systems  for  Fortran  and 
G-f-b.  The  goal  of  each  system  is  to  provide  high  performance  and  portability  across  the  three 
prevailing  classes  of  computer  architectures  which  are  distinguished  by  the  memory  model  they 
present  to  the  programmer. 

True  Shared  Memory.  The  address  space  is  global  to  the  machine  and  access  to  memory  is 
uniform.  Examples  of  this  class  include  the  (CRAY  fC90  and  the  SGI  Power  (Challenge  Series. 

Shared  Memory  with  Non-Uniform  Memory  Access  (NUMA).  The  address  space  is  global 
to  the  machine  and  access  to  memory  is  non  uniform,  i.e.,  access  time  depends  on  the  address 
and  the  processor  doing  the  data  access.  Examples  of  this  class  include  the  BBN  TC2000. 
the  CRAY  T3D  and  the  Convex  MPP. 

Distributed  Memory  Architecture:  The  addre-ss  space  is  local  to  each  proces.sor  and  access  to 
remote  data  is  done  usually  via  message  passing.  Examples  here  include  the  Intel  f’aragon, 
nCube,  Parsytec,  Meiko  and  Thinking  Machines  CM-5. 
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Because  the  first  two  categories  provide  a  global  address  space  lor  referencing  data,  they  are  the 
closest  to  the  model  most  familiar  to  users.  Consequently,  the  most  direct  way  to  make  all  three 
share  this  property  is  to  build  an  operating  system  layer  for  the  distributed  memory  machines  that 
provides  a  “shared  virtual  memory”  model  on  top  of  the  native  message  passing  system.  Given 
such  a  system,  any  Fortran  or  C  program  can  execute  without  modification  on  the  machine.  The 
problem  is  then  reduced  to  providing  a  way  for  the  compiler  to  partition  parallel  loops  and  schedule 
access  to  shared  objects.  Fortran-S,  designed  at  IRISA  is  one  such  system.  In  the  paragraphs  that 
foUow,  we  shall  describe  many  of  the  important  ideas  that  go  into  the  construction  of  a  shared 
virtual  memory  operating  system  and  the  Fortran-S  compiler. 

Fortran-S  uses  program  annotations  to  partition  control  and  the  operating  system  automatically 
partitions  the  data.  An  alternative  strategy  is  to  ask  the  user  to  specify  the  way  the  data  should  be 
partitioned  and  have  the  compiler  decide  how  to  partition  the  control.  High  Performance  Fortran 
(HPF)  follows  this  approach.  While  HPF  code  could  be  compiled  for  a  shared  virtual  memory,  most 
systems  will  use  the  compiler  to  generate  explicit  message  passing  on  distributed  memory  machines. 
In  the  third  section  of  this  paper  we  describe  the  HPF  model  and  the  language  annotations  and 
extensions  required  to  implement  it. 

A  third  approach  which  combines  the  features  of  both  HPF  and  shared  virtual  memory  is  pC-l--l- 
which  is  based  on  a  language  extension,  called  concurrent  aggregates,  which  allows  the  programmer 
to  define  a  set  of  distributed  objects  which  may  be  referenced  from  any  processor  in  this  system. 
As  with  HPF,  the  user  provides  information  to  the  system  about  how  the  data  objects  should  be 
partitioned  among  the  system  memory  modules.  However,  communication  between  objects  uses  a 
mechanism  based  on  the  SVM  paging  model,  but  instead  of  migrating  pages  of  data,  copies  of  data 
objects  are  migrated.  In  the  last  section  of  this  paper  we  describe  pC-f-f-  and  its  execution  model. 

In  this  paper  we  have  not  described  other  promising  approaches.  Among  these  are  functional 
programming  languages  such  as  SISAL  [43]  and  ID,  and  task  parallel  systems  such  as  CC-f-f  [31], 
Linda  [5],  Fortran-M  or  that  proposed  in  [13]. 

2  Fortran-S  and  Shared  Virtual  Memory 

Programming  with  shared-memory  or  NUMA  is  usually  simpler  than  programming  distributed 
memory  architectures  because  they  offer  a  global  view  of  the  memory  where  distributed  memory 
architectures  let  the  user  deals  with  data  exchanges  between  processors  by  means  of  messages 
passing.  Shared  memory  architectures  are  attractive  from  the  programming  point  of  view  but 
they  cannot  afford  scalability.  As  the  number  of  processors  increases  the  cost  of  the  switch  used 
to  connect  memory  to  processors  increases  very  fast,  and  may  even  not  be  built  at  the  required 
speed.  Fully  distributed  memory  architectures,  on  the  other  end,  are  scalable  but  do  not  offer  to 
the  programmer  a  single  address  space,  making  programming  more  complex. 

For  programming  distributed  memory  architectures,  approach  as  HPF  proposes  a  global  address 
space  to  the  programmer  and  fills  the  gap  between  the  programmer  model  and  the  machine  by  using 
sophisticated  compiler  techniques  and  the  help  from  the  programmer  who  is  in  charge  of  specifying 
at  a  high  level  the  data  distribution  on  the  processors.  Another  alternative  consists  in  providing 
the  functionalities  of  a  shared  memory  (it  becomes  virtual  in  this  case)  either  implemented  using 
hardware  support  or  using  operating  system  support.  This  approach  makes  distributed  memory 


architectures  look  like  NUMA  architectures  which  makes  programming  simpler  and  the  compiler 
much  easier  to  design. 

2.1  Shared  Virtual  Memory 

A  Shared  Virtual  Memory  (SVM)  provides  to  the  user  an  abstraction  from  an  underlying  memory 
architecture  of  a  distributed  memory  parallel  computer  (DMPC).  This  memory  abstraction  is  also 
named  VSM  (Virtual  Shared  Memory),  DSM  (Distributed  Shared  Memory)  [37],  etc.  We  will  use 
SVM  to  name  this  memory  abstraction.  An  SVM  [39]  is  sonjewhat  similar  to  the  one  which  is 
currently  used  on  classic  mainframe  computers.  However,  it  differs  in  the  fact  that  this  virtual 
memory  is  shared  by  several  processors.  It  provides  a  virtual  address  space  that  is  .shared  by  a 
number  of  processes  running  on  different  processors  of  a  distributed  memory  parallel  computer. 
The  virtual  address  space  is  made  up  of  pages*  which  are  spread  among  local  processor  memorie.s 
according  to  a  mapping  function.  Compared  to  a  global  address  space  available  on  shared  memory 
parallel  computers  (SMPCs),  SVM  relies  on  page  caching  and  heavily  on  spatial  locality.  A  global 
address  space,  like  the  one  available  on  the  BBN,  usually  allows  access  to  a  single  word  through  the 
use  of  a  fast  interconnection  network.  In  most  cases  DM  PCs  are  loosely  coupled  architectures  that 
have  a  high  latency  network.  Accessing  the  data  at  a  page  level  absorbs  this  high  latency  when 
spatial  locality  is  exposed.  Since  the  granularity  of  the  data  accesses  is  a  page,  several  problems 
arise.  For  example,  how  does  one  keep  pages  coherent  that  are  stored  in  several  caches?  How  do 
we  locate  an  up-to-date  copy  of  a  given  page  within  the  architecture?  What  happens  when  there 
is  not  enough  room  in  a  cache? 

The  first  problem  is  related  to  cache  coherence.  Since  processors  may  have  to  read  from  or  to 
write  to  the  same  page,  several  processors  have  a  copy  of  a  page  in  their  cache.  If  one  processor 
modifies  its  copy,  other  processors  run  the  risk  of  reading  an  old  copy.  A  cache  coherence  protocol 
is  needed  to  ensure  that  the  shared  address  space  is  kept  coherent  [12].  A  memory  is  considered  to 
be  strongly  coherent  if  the  value  returned  by  a  read  from  a  location  of  the  shared  address  space  is 
the  value  of  the  latest  store  to  that  location  [12].  In  most  cases,  implementation  of  strong  coherence 
in  a  SVM  for  DMPCs  is  based  on  an  invalidation  mechanism.  It  assumes  that  there  is  only  one 
copy  of  a  page  with  write  access  mode  at  a  given  time  or,  if  there  is  multiple  copies  of  a  page,  each 
of  them  are  in  read-only  access  mode.  The  processor  that  has  written  most  recently  into  the  page 
is  called  the  owner  of  the  page.  When  a  processor  needs  to  write  to  a  page,  that  is  not  present  in 
its  cache  or  is  present  in  read-only  mode,  it  sends  a  message  to  the  owner  of  the  page  in  order  to 
move  it  to  the  requesting  processor.  It  then  invalidates  all  the  copies  in  the  system  by  sending  a 
message  to  the  relevant  processors.  This  invalidation  strategy  seems  to  be  the  best  approach  for 
DMPCs.  The  faulting  management  mechanism  of  a  MMU  is  sufficient  to  implement  this  approach 
efficiently. 

The  second  problem  is  called  page  ownership.  When  a  processor  needs  to  access  a  page,  either  in 
write  or  read  access  mode,  which  is  not  located  in  its  cache,  it  must  ask  the  owner  to  send  it  a  copy 
of  the  page.  This  problem  is  related  to  the  cache  coherence  protocol  described  previously.  With  the 
invalidation  protocol,  there  is  always  one  owner  for  a  page  and  the  ownership  changes  according 
to  the  page  requests  coming  from  other  processors.  Therefore,  the  problem  is  how  to  locate  the 
current  owner  of  a  given  page  considering  that  the  owner  of  a  page  changes.  A  solution  is  to  update 

*the  granularity  afforded  by  hardware  virtual  memory 
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a  database  that  keeps  track  of  the  moveiuent  of  pages  in  the  system.  This  database  can  be  either 
centralized  OT  distributed  [39].  In  the  centralized  approach,  a  processor  (called  the  manager)  is  in 
charge  of  updating  the  database  for  every  page.  When  a  processor  needs  a  page,  it  sends  a  request 
to  the  manager  which  forwards  the  request  to  the  owner  of  the  page.  Consequently,  the  manager 
is  aware  of  all  the  movement  of  pages  in  the  system.  However,  it  may  be  a  bottleneck  since  the 
manager  processor  receives  requests  from  all  other  processors.  The  user’s  process  running  on  the 
manager  will  be  interrupted  frequently  and  this  approach  will  also  create  potential  contention  in 
the  network.  Distributing  the  database  over  several  prore.ssors  is  a  means  to  avoid  these  drawbacks. 

The  last  problem  is  called  page  swapping.  The  problem  arises  when  a  processor  is  the  owner 
of  all  the  pages  located  in  its  cache  and  there  is  no  more  space  in  the  cache.  If  it  requests  a  new 
page,  it  h«is  to  find  space  in  its  cache.  It  cannot  throw  away  a  page  from  its  cache  since  it  owns 
all  the  pages.  Therefore,  pages  have  to  be  moved  on  an  external  high  speed  storage  device,  like 
disks.  Some  implementations,  such  as  KOAN  [34],  use  the  other  local  memory  for  swapping  page. 
However,  the  size  of  the  virtual  address  space  is  bounded  by  the  sum  of  ah  local  memory. 

The  implemeutatioii  of  SVM  mechanisms  is  done  mostly  by  software;  page  requests  are  pro¬ 
cessed  by  the  operating  system  running  on  each  node.  This  implementation  involves  a  substantia! 
overhead  since,  user’s  processes  have  to  be  stopped  by  the  operating  system  to  resolve  page  re¬ 
quests.  This  task  could  be  done  by  using  dedicated  VLSI  hardware  such  is  done  with  the  KSR  [2] 
machine  and  SCI  bus-based  parallel  architectures  [1]  such  as  the  new  Convex  MPP 

2.2  Why  Shared  Virtual  Memory  May  Not  Work 

Shared  Virtual  Memory  has  many  intrinsic  problems.  In  the  following  paragraphs,  we  discuss  some 
of  them. 

Initial  Page  Distribution 

The  initial  page  distribution  may  lead  to  cold  start  misses,  however  this  has  a  marginal  effect  on 
performance.  After  the  beginning  of  the  application,  pages  migrate  to  processors  according  to  data 
accesses. 

Page  Thrashing 

Page  thrashing  can  lead  to  capacity  misses.  For  instance  consider  the  following  loop: 

doall  i  =l,n 
do  i  =  l,n 

A(j,i)  =  f(....  B(i,j),...) 
enddo 
enddo 

Due  to  the  Fortran  column-wise  data  layout,  each  access  to  matrix  B  will  create  a  page  fault  if  n 
is  large  enough.  In  that  case  interchanging  the  loop  would  not  help,  but  loop  blocking  would. 


False  Sharing 

False  Sharing  occurs  when  more  than  one  processor,  at  a  time,  writes  to  the  same  page.  The  strong 
coherence  mechanism  ensures  that  each  processor  writing  into  a  page  sees  the  last  modification  of 
it.  For  example,  consider  the  following  loop. 

doall  i  “  l,n 

A(i)  *  f( . ) 

enddo 

Assuming  that  A  is  allocated  in  a  shared  address  space  and  is  only  stored  into  one  page,  when 
increasing  the  number  of  processors  the  page  will  exhibit  a  ping-pong  phenomena.  That  is,  the 
page  will  move  back  and  forth  between  processors,  each  write  costing  one  page  fault  at  worst  (each 
word  written  will  cost  a  data  transfer  of  the  size  of  the  page).  The  execution  of  the  loop  becomes 
sequential  because  a  page  manager  will  serve  only  one  page  request  at  a  time.^  This  phenomena 
may  severely  degrade  performance.  However  by  increasing  the  size  of  the  vector  this  phenomena 
may  become  negligible. 

Barrier 

When  programming  using  message  passing,  synchronization  between  processes  comes  for  free;  data 
exchanges  synchronize  processes.  When  using  shared  variables,  synchronization  must  be  inserted 
to  ensure  data  dependences  between  processes.  However  synchronization  does  not  have  to  be 
implemented  using  shared  variables.  Most  system  support  some  sort  of  barrier  operation  which  can 
be  used  as  the  primary  synchronization  mechanism.  If  the  barrier  is  too  slow,  serious  performance 
problems  may  result. 

Broadcast 

A  drawback  of  shared  virtual  memory  on  DMPCs  is  its  inability  to  run  efficiently  parallel  algorithms 
that  contain  a  producer/consumers  scheme.  In  these  cases,  a  page  is  modified  by  a  processor  and 
then  it  is  accessed  by  the  other  processors.  Since  all  page  requests  are  sequentially  processed  by  a 
page  manager  the  accesses  to  the  data  are  done  sequentially.  This  obviously  constitutes  a  serious 
bottleneck  when  the  number  of  processors  grows. 

2.3  Why  Shared  Virtual  Memory  May  Work 

Shared  virtual  memory  may  work  surprisingly  well  (see  section  3.3)  for  the  following  reasons. 
Vectorized  Page  access  to  data  (block  transfer) 

Transferring  a  page  makes  efficient  use  of  the  network,  masking  most  network  latencies.  There 
is  clearly  a  tradeoff  when  choosing  the  page  size.  A  large  page  size  makes  efficient  use  of  the 
network,  but  increases  the  amount  of  unnecessary  data  transferred  and  false  sharing  becomes  a 
greater  problem.  A  small  page  size  transfers  a  greater  percent  of  useful  data  and  decreases  false 
sharing,  but  it  makes  inefficient  usage  of  the  network.  To  deal  with  a  small  transfer  size  on  the 
KSR,  where  subpages  of  128  bytes  are  the  basic  transfer  unit,  prefetch  and  poststore  facilities  are 
used  to  hide  large  access  latencies. 
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Figure  1:  Allocation  of  data  in  memory,  assuming  2  processors  and  assuming  4  processors,  in  the 
case  of  4  processors  more  memory  can  be  devoted  to  page  caching 

Page  caching  allows  the  exploitation  of  data  locality 

Page  caching  is  the  only  way  to  compensate  for  the  cost  of  moving  a  page  between  processors.  This 
decreases  the  size  of  the  effective  SVM.  Indeed  the  page  caching  memory  is  the  remainder  of  the 
memory  not  used  by  the  local  data  and  the  shared  data.  At  some  point  if  data  are  too  big  the 
remaining  memory  may  be  too  small  to  keep  pages  necessary  for  an  algorithm  to  behave  efficiently. 
The  main  advantage  of  SVM  over  other  mechanisms  is  that,  even  when  locality  properties  of  the 
program  cannot  be  discover  at  compile  time,  the  SVM  can  still  exploit  it.  In  this  respect  Shared 
Virtual  Memory  addresses  the  same  problem  as  the  Parti  inspector/executor  scheme  (66). 


All  variables  do  not  have  to  be  shared 

Only  variables  that  are  subject  to  parallel  computation  should  be  shared.  Sharing  all  variables 
leads  to  very  inefficient  code. 


Not  all  parallel  computations  depend  on  the  SVM 

For  example,  exploiting  reduction  parallelism  is  usually  not  done  using  shared  virtual  memory. 
Instead  it  can  be  implemented  by  the  compiler  by  using  message  passing. 

The  Compiler  Can  Help  A  Lot 

Compiler  technique  can  help  by  optimizing  programs  so  that  they  make  better  use  of  the  vir¬ 
tual  shared  memory  and  also  by  decreasing  the  number  of  synchronizations  in  the  program  (i.e. 
decreasing  the  number  of  barriers). 


2.4  Parallel  Loop  Scheduling  and  Shared  Virtual  Memory 

Parallel  loops  scheduling  is  a  critical  issue  in  a  programming  environment  that  rebes  on  shared 
virtual  memory.  Data  movements  are  in  charge  of  the  system/hardware  where  loop  scheduling  is  in 
charge  of  the  compiler/user.  Good  scheduling  has  to  ensure  data  locality  and  load  balancing.  Bad 
loop  scheduling  may  result  in  many  unnecessary  page  migrations,  false  sharing  or  an  unbalanced 
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load.  It  should  be  noted  that  techniques  such  as  self  guided  scheduling  are  not  very  well  suited  to 
shared  virtual  memory,  because  as  the  execution  of  a  parallel  loop  proceeds,  the  size  of  blocks  of 
iterations  that  are  assigned  to  processors  decreases,  (-onsequently,  false  sharing  increases.  However, 
when  associated  with  a  cache  coherence  protocol  that  allows  concurrent  access  to  the  same  page, 
this  technique  may  become  adequate  if  it  can  be  implemented  efficiently  on  massively  parallel 
distributed  memory  architecture. 

There  are  two  main  scheduling  techniques  that  are  well  suited  to  shared  virtual  memory  because 
they  can  be  used  to  reduce  data  movements.  The  first  technique  addresses  the  problem  of  false 
sharing  especially  when  strong  coherence  protocols  are  used,  and  the  second  technique  is  concerned 
with  data  reuse  across  loops.  In  addition,  they  can  be  used  together  to  provide  good  locality  and 
to  decrease  the  false-sharing: 

Page  .41igned  Scheduling 

Page  aligned  scheduling  can  be  used  to  reduce  false  sharing.  The  principle  consists  of  distributing 
iteration  such  that  chunks  of  iterations  allocated  on  processors  are  aligned  with  page  boundaries. 
For  example,  if  we  use  a  simple  block  scheduling  strategy  of  a  simple  loop: 

do  i  =  1,  N 

A[i]  =  A[i]  +  . . . 
anddo 

we  get: 

bf  »  ceiling (N/P) 
doall  pid  s  l,N,bf 

do  i=  pid,min(pid+bf-l ,N) 

ACi]  =  A[i]  +  . . . 
anddo 
anddo 

If  N j P  is  not  a  multiple  of  the  page  size  there  will  be  false-sharing  for  each  page  shared  by  two 
processors.  A  simple  solution  in  that  case  is  to  consider  a  blocking  factor  bf  that  take  into  account 
the  page  size  (assuming  A[l]  is  aligned  on  a  page  boundary): 

bf  =  cailing(ceiling(N/pagesize)/P)  *  pagesize 
doall  pid  -  l,N,bf 

do  i*  pid,min(pid+bf-l ,N) 

ACi]  =  A[i]  ♦  .  .  . 
anddo 
enddc 

However  this  technique  does  not  always  balance  the  workload.  In  general,  this  technique  works 
well  when  the  amount  of  data  is  large.  A  more  complete  description  of  the  method  is  given  in  [25]. 
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Affinity  Scheduling 

The  affinity  scheduling  tries  to  minimize  data  movement  by  allocating  iterations  to  processors 
according  to  data  location  [42].  An  affinity  scheduling  technique  is  provided  on  the  KSRl  machine. 

It  should  be  noted  that  dynamic  scheduling,  to  improve  load  balancing,  can  be  implemented 
with  SVM  but  on  distributed  memory  architectures  the  implementation  of  such  a  technique  is 
usually  very  costly  at  runtime. 

2.5  Compiler  Optiml.zations  for  SVM 

Compiler  optimization  for  Shared  Virtual  Memory  consists  in  increasing  data  locality,  and  thus 
minimizing  data  transfers.  Optimization  consists  of  removing  shared  variables  as  much  as  possible, 
changing  the  data  layout,  and  applying  loop  transformations  to  improve  locality  without  killing 
the  load  balance. 

Removing  Shared  Variables 

In  some  cases,  it  is  possible  to  localize  shared  variables.  The  idea  behind  this  optimization  relies  on 
the  compilers  capability  of  detecting  access  to  data  structure  that  are  disjoint  between  processors. 
Compiler  techniques  used  in  this  case  are  very  close  to  the  ones  used  for  compiling  Fortran  D  and 
HPF  programs.  [62,  30]. 

Array  Padding  and  Data  layout 

The  Array  Padding  operation  consists  of  extending  array  dimensions  such  that  dimensions  uf  the 
array  are  aligned  with  page  boundaries.  This  reduces  false-sharing  because  different  vectors  of  the 
array  do  not  share  any  pages.  The  main  di.sadvantages  of  this  technique  are  that  it  wastes  memory 
(and  so  decreases  the  size  of  the  memory  that  can  be  allocated  to  the  cache)  and  also  that  this  ma> 
increase  the  amount  of  communica,tion  (unused  data  are  loaded  when  accessing  useful  data).  More 
generally,  data  layout  optimization  tries  to  store  data  so  that  it  minimizes  false  sharing  [61,  18]. 

Optimizing  Data  Locality 

Optimizing  data  locality  reb’es  on  changing  the  access  order  to  data  structure  so  that  it  increases 
the  spatial  locality  of  a  loop  or  it  exploits  better  temporal  locality.  Loop  transformations  like  loop 
interchanging,  blocking,  unimodular  transformation  may  be  used.  When  temporal  locality  exists, 
it  may  be  possible  to  exploit  data  locality  using  localization  of  a  portion  of  an  array  section  that 
is  subject  to  reuse.  These  techniques  are  common  to  global  address  space  optimization  and  cache 
locality  optimization.  For  example,  considering  the  following  loop. 

doall  1*1, n 
do  j  *  l,n 
do  k  *l,n 

A(k.i)  =  f(...,A(k.i),...) 
enddo 
enddo 
enddo 
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It  can  be  transformed  into 


doall  i*l,n 
do  t*l,n 

temp(t)  »  A(t,i) 
anddo 

do  j  =  l,n 
do  k  «l,n 

tamp(k)  *  f (. . . ,temp(k) , . . .) 
anddo 
anddo 
do  t*l,n 

A(t,i)  *  temp(t) 
enddo 
enddo 

where  ieinp  is  allocated  locally  on  processors.  This  optimization  may  reduce  the  number  of  page 
faults  and  the  false-sharing.  It  should  be  noted  that  the  array  temp  is  the  reference  window  as 
defined  in  (21,  9].  The  cost  of  the  copy  is  amortized  by  exploiting  the  temporal  locality.  However  if 
there  was  no  page  thrashing  and  no  false-sharing  on  array  A  in  the  original  loop,  there  is  no  gain  in 
using  this  transformation.  When  applying  this  kind  of  optimization,  the  size  of  temporaries  must 
be  limited.  These  techniques  [63,  4,  41,  3, 23,  64,  55,  16,  44]  are  well  known  but  usually  targeted  for 
hardware  cache  or  local  memory.  Most  of  these  techniques  should  be  revisited  to  take  into  account 
the  characteristics  of  shared  virtual  memory  and  in  particular  the  false  sharing  phenomena. 

Barrier  Removal 

When  piogramming  with  a  shared  memory  model  (especially  when  the  execution  model  is  SPMD) 
synchronization  between  processes  relies  on  barriers.  One  optimization  the  compiler  cai,  perform  is 
to  decrease  the  number  of  synchronization  in  the  program.  More  generally,  part  of  the  optimization 
process  consists  in  removing,  as  much  as  possible,  calls  to  the  runtime  system. 

2.6  Runtime  Optimization  for  SVM 

In  some  case,  support  for  optimizations  may  come  from  system  capabilities: 

Weak  Coherence 

Several  weak  cache  coherence  protocols  have  been  studied  in  the  past.  Each  of  them  has  some 
properties  that  can  be  exploited  in  a  specific  context  [.50,  57).  A  modified  version  of  the  strong 
coherence  protocol  can  be  considered  as  a  weak  cache  coherence  protocol.  If  data  accesses  are 
made  in  different  memory  locations,  it  allows  proces.sors  to  modify  their  own  ropy  of  a  page, 
without  invalidating  copies  in  other  processors.  When  restoring  the  strong  coherence  protocol,  all 
the  copies  of  a  page  which  have  been  modified  are  merged  into  a  single  page  that  reflects  aO  the 
changes.  From  the  programmer’s  point  of  view,  the  memory  is  always  strongly  coherent  at  a  word 
level  but  is  weak  coherent  at  a  page  level.  However  such  weak  coherence  scheme  does  not  come 
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for  free;  its  cost  depends  (usually  linearly)  on  the  maxiamtu  nuniber  of  page  copies  there  are  to 
merge  at  end  of  the  page  weak  coherence  phase  there  is  to  perform.  Weak  Coherence  protocol  can 
be  used  for  paraOel  loops  because  there  is  no  data  dependence  between  iterations  of  the  loops  and 
so  no  several  writes  to  the  same  word  of  a  page  are  performed  by  severed  different  iterations. 

Page  Broadcast 

Producer/consunjers  scheme  can  be  efficiently  managed  by  using  the  broadcasting  facility  of  the 
underlying  topology  of  DMPCs  (hypercube,  2D-mesh,  etc  ).  All  pages  that  been  modified  by 
the  processor  in  charge  of  running  the  producer  phase  are  broadcast  to  all  other  processors  that 
will  run  the  consumer  phase  in  parallel.  Since  the  producer  has  to  keep  track  of  ail  pages  that 
have  been  modified,  two  new  operating  system  calls  have  to  be  added  in  the  user’s  rode  in  order 
to  specify  both  the  beginning  and  the  ending  of  the  producer  phase. 

Page  locking 

Page  locking  allows  a  processor  to  lock  a  page  into  its  cache  until  it  decides  to  release  it.  This  basic 
mechanism  can  be  used  to  implement  atomic  update  in  a  memory  location.  The  user  is  responsible 
for  adding  two  system  calls  that  specify  the  beginning  and  the  ending  of  the  code  section  where 
each  remote  data  access  requires  a  page  to  be  locked  into  the  cache.  Page  locking  is  very  efficient 
and  minimizes  the  number  of  critical  sections  within  a  parallel  code.  On  loosely  coupled  parallel 
architectures,  such  as  DMPCs,  using  critical  sections  are  time  expensive.  To  illustrate  this,  let  us 
take  a  small  example  such  as  a  matrix  assembly  found  in  finite  element  applications.  A  loop  is 
used  to  scan  an  irregular  mesh  and  values  are  accumulated  into  a  matrix.  .Access  to  this  matrix  is 
made  by  through  an  index  scheme  and  there  are  often  runtime  data  dependences.  Consequently  the 
loop  can  be  parallelized  if  the  accumulation  is  executed  within  a  critical  section  to  avoid  multiple 
processors  writing  at  the  same  time  to  the  same  matrix  element.  A  page  locking  mechanism  can 
replace  many  critical  sections.  Before  updating  a  matrix  element,  the  page  that  contains  the  matrix 
element  is  locked  into  the  cache  and  then  release  after  the  update.  The  cost  of  such  synchronization 
is  simply  related  to  the  number  of  processors  that  access  to  the  same  page  at  the  same  time. 

2.7  Mixing  Messages  and  Shared  Virtual  Memory 

Mixing  of  message  passing  and  shared  variables  can  be  used  to  improve  performance  in  library 
code.  When  dealing  with  shared  variables  and  messages,  programming  is  somewhat  simplified 
since  the  programmer  does  not  have  to  worry  about  the  data  distribution.  The  programmer  only 
has  to  think  in  term  of  parallel  processes.  One  of  the  main  aavantage  of  this  approach,  is  that 
an  efficient  algorithm  may  be  implemented  independently  of  the  program  it  is  called  from.  For 
example,  consider  the  in-place  matrix  tran.spose.  This  algorithm  behaves  very  badly  with  SVM 
when  data  transfer  is  at  the  level  of  pages.  But  by  using  message  passing  to  do  the  transpose,  it 
is  possible  to  get  speedup  on  this  operation.  The  algorithm  can  be  written  so  it  is  independent 
of  the  data  distribution  of  the  matrix.  In  a  pure  message  passing  programming  environment,  it  is 
not  possible  to  provide  such  a  primitive  without  forcing  the  programmer  to  use  a  predefined  data 
distribution  of  the  matrix  on  the  processor,  this  data  layout  that  may  be  completely  inadequate  in 
the  remainder  of  the  application. 
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3  Fortran-S:  a  Prototype  Environment  for  SVM 

Fortran-S  is  a  Fortran  prograniining  environment  that  relies  on  the  shared  virtual  niemory  KOAN 
The  programming  model  is  based  on  shared  variables  and  parallel  loops.  Parallel  loops  and  shared 
variables  are  declared  to  the  compiler  via  directives.  The  project’s  main  goal  is  to  study  compiler 
and  programming  environment  for  shared  virtual  memory.  Fortran-S  differs  from  the  KSR-Fortran 
mainly  in  the  execution  model,  KSR-Fortran  relies  on  fork-and-join  execution  (i.e.  the  main  thread 
is  spawn  in  multiple  threads  when  parallel  phases  of  execution  occur)  where  Fortran-S  relies  on  a 
SPMD  execution  model  (i.e.  a  thread  is  created  on  every  processor  during  the  loading  phase).  To 
iUustrate  Fortran-S,  we  provide  the  following  small  example: 

real  v(n,n) 

C$ann[Shared(v)] 

do  t  =  1 ,  n 
tmp  =  0.0 
do  fc  =  l,n 

trap  =  tmp  -h  v(k,i)*v(k,i) 

enddo 

xnorm  =  1.0  /  sqrt(tmp) 
do  k  =  l,rt 

v{k4)  =  v(k,i)  *  xnorm 

enddo 

C$annlDoShared("  BLOCK"  )) 
do  j  =  *+  l,n 
tmp  =  0.0 
do  t  =  1 ,  n 

trap  =  tmp  +  v(k,i)*v(k  j) 

enddo 
do  fc  =  l,rt 

v(kj)  =  v(k  j)  -  tmp*v(k,i) 

enddo 

enddo 

enddo 

This  is  a  parallel  version  of  the  Modified  Gram-Schmidt  algorithm.  It  is  made  up  of  two 
nested  loops.  The  outer  loop  normalizes  each  vector  stored  in  the  matrix  v.  When  a  vector  is 
normalized,  the  remaining  vectors  in  the  matrix  are  then  corrected  by  executing  the  inner  loop. 
These  corrections  can  be  done  in  parallel.  By  adding,  two  Fortran-S  directive,  the  code  generator 
is  able  to  generate  a  SPMI)  code  that  will  be  executed  in  every  processor.  The  first  directive 
(C$ann(Shared(v)])  specifies  that  matrix  v  has  to  be  stored  in  the  shared  virtual  memory,  since 
it  wiU  be  up  ri  a  ted  within  a  parallel  loop.  Other  variables  are  replicated  in  the  local  niemory  of 
each  processor.  Every  processor  executes  the  outer  loop  as  well  as  all  assignments  that  modify 
a  local  variable.  However,  for  each  outer  loop  iteration,  only  one  jirocessor  updates  the  shared 
variable  v(k,i)).  (In  the  previous  example,  every  processor  will  write  into  replicated  variables  tmp 
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and  xnorxn.)  The  second  directive  {C$ann[DoShared(" BLOCK''))  indicate  that  the  foUowing  loop  is 
a  parallel  loop.  Each  processor  is  in  charge  of  executing  a  chunk  of  the  iteration  space.  A  detailed 

rription  of  Fortran-S  can  be  found  in  [11]. 

3.1  KOAN  Runtime 

The  KOAN  SVM  is  embedded  in  the  operating  system  of  the  iPSC/‘2.  It  allows  the  use  of  fast  and 
low-level  communication  primitives  as  well  as  a  Memory  Management  Unit  (MMU).  The  KOAN 
SVM  implements  the  fixed  distributed  manager  algorithm  as  described  in  [39]  with  an  invalidation 
protocol  for  keeping  the  shared  memory  coherent  at  all  times.  A  detailed  description  of  the  KOAN 
SVM  can  be  found  in  [34].  Let  us  now  summarize  some  of  the  functionalities  of  the  KOAN  SVM 
runtime, 

KOAN  SVM  provides  the  user  with  several  memory  management  protocols  for  efficiently  han¬ 
dling  special  memory  access  patterns.  One  of  these  is  when  several  processors  have  to  write  into 
different  locations  of  the  same  nage.  This  pattern  involves  many  messages  since  the  page  has  to 
move  from  processor  to  processor  (as  with  the  ping-pong  effect  or  fa/se  sharing).  At  a  cost  of  adding 
some  new  subroutine  calls  in  the  parallel  code,  KOAN  can  let  processors  concurrently  modify  their 
own  copy  of  a  page.  Another  drawback  of  shared  virtual  memory  on  DM  PCs  is  its  inability  to 
run  efficiently  parallel  algorithms  that  contain  a  producer/consumers  scheme:  a  page  is  modified 
by  a  processor  and  then  accessed  by  the  other  processors.  KOAN  SVM  can  efficiently  manage  this 
memory  access  pattern  by  using  the  broadcasting  facility  of  the  underlying  topology  of  DM  PCs 
(hypercube,  2D-mesh,  etc.).  AU  pages  that  have  been  modified  by  the  processor  in  charge  of  run¬ 
ning  the  producer  phase  are  broadcast  to  all  other  processors  that  wiU  run  the  consumer  phase  in 
parallel.  KOAN  SVM  provides  barrier  synchronization  as  well  as  subroutines  to  manage  critical 
sections.  These  features  are  implemented  by  using  messages  instead  of  shared  variables.  KOAN  is 
compatible  with  the  NX/2  operating  system,  i.e.  primitives  provided  by  the  system  can  be  used 
simultaneously  with  KOAN, 

We  have  performed  measurements  in  order  to  determine  the  costs  of  various  basic  operations 
for  both  read  and  write  page  faults  (the  size  of  a  page  is  4  Kbytes)  of  the  KOAN  shared  virtual 
memory.  For  each  type  of  page  fault  (read  or  write),  we  have  tested  the  best  and  worst  possible 
situation  on  different  numbers  of  processors.  For  a  32-processor  configuration,  the  time  required  to 
resolve  a  read  page  fault  is  in  the  range  of  3.412  ms  to  3.955  ms.  For  a  write  page  fault,  timing 
results  are  in  the  range  of  3.447  ms  to  10.110  ms  depending  on  the  number  of  copies  that  have  to 
be  invalidated.  These  results  can  be  compared  with  the  communication  times  of  the  iPSC/2:  the 
latency  is  roughly  0.3  ins  and  sending  a  4  Kbytes  message  (a  page)  costs  between  2.17  ms  and 
2.27  ms  depending  on  the  number  of  routing. 

3.2  Fortran-S  Code  Generator 

Fortran-S^  relies  on  parallel  loops  to  achieve  parallelism.  Parallel  execution  is  achieved  using 
the  SPMD  execution  model  (Single  Program  Multiple  Data).  At  the  beginning  of  the  program 
execution,  a  thread  is  created  on  each  processor  and  each  processor  starts  to  execute  the  program. 
One  of  the  main  functions  of  the  Fortran-S  compiler  is  to  make  the  SPMD  execution  to  look  like  a 

^The  prototype  compiler  has  been  implemented  using  the  f^ignia  System  [22] 
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Table  1;  Perfoniiance  results  for  the  Jacobi  loops. 


single  threaded  execution,  by  appropriate  insertion  of  synchronization  and  the  correct  updating  of 
shared  variables.  The  programming  model  uses  directives  to  specify  shared  variables  and  parallel 
loops.  A  shared  variable  is  accessible  in  read  or  write  from  all  the  processors.  A  non  shared  variable 
is  duplicated  on  all  the  processors.  Since  every  processor  executes  the  sequential  code  sections,  non- 
shared  variables  have  always  the  same  value.  The  iteration  space  of  a  parallel  loop  is  distributed 
over  the  processor.  Each  processor  only  executes  a  subset  of  the  iteration  space.  Fortran-S  provides 
several  directives  to  generate  efficient  parallel  code  [11]. 

3.3  Performance 

In  this  section  we  present  the  first  results  obtained  using  Fortran-S  on  an  Intel  iPSC/2  with 
32  nodes.  The  goal  of  these  experiments  was  to  port  sequential  Fortran  77  programs  to  Fortran-S 
and  to  measure  the  performance  obtained.  We  did  not  intended,  in  those  early  performance  mea¬ 
surements,  to  modify  extensively  the  applications.  Rather,  we  intended  to  measure  performance  of 
Fortran-S  in  a  straight  forward  translation  from  Fortran  77.  Very  few  modifications  have  been  done 
to  the  original  program.  The  primary  modification  was  to  expose  parallel  loops  in  the  programs. 
However  no  modification  of  the  data  structure  used  in  the  program  was  made.  Also  they  were 
no  major  modification  to  the  algorithms,  so  the  scalability  of  some  application  is  not  limited  by 
Fortran-S  but  by  the  algorithm  used  in  the  application.  The  problem  of  false-sharing  that  appears 
in  many  applications  was  solved  using  a  weak  coherence  protocol. 

The  first  code  used  is  taken  from  a  Jacobi  iteration.  Table  1  gives  the  speedups  and  efficiencies 
for  different  problem  sizes  when  using  either  a  strong  or  a  weak  cache  coherence  protocol.  For  a 
matrix  size  set  to  100  x  100,  we  got  a  “speed-down”  when  the  number  of  processors  is  greater  than 
16.  False  sharing  could  be  avoided  by  using  weak  coherence  protocol.  For  the  same  problem  size, 
this  cache  coherence  protocol  improves  the  speedups  a  little,  but  the  speed-up  remains  flat.  For  a 
larger  problem  size  (200  x  200)  we  did  not  observe  such  phenomena.  However  when  the  number  of 
processors  is  set  to  32,  the  efficiency  is  bad  (20.71%).  The  weak  cache  coherence  protocol  increases 
the  efficiency  to  32.49%.  This  behavior  is  observed  only  for  small  matrices.  For  large  matrices  the 
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1  With  strong  coherence 

#  proc. 

100  X  100  j 

1  200  X  200  1 

■tSSMGSl 

M2m\ 

i  lilies  (ms) 

Speedup 

1 

15694 

- 

- 

127657 

- 

- 

2 

7920 

1.98 

99.08 

1.99 

99.67 

4 

4056 

3.87 

96.73 

32292 

3.95 

98.83 

8 

2206 

7.11 

88.93 

16522 

7.73 

96.58 

16 

3393 

4.63 

28.91 

8982 

14.21 

88.83 

32 

4379 

3.58 

11.20 

5196 

24.57 

76.78 

With  weak  coherence  { 

1 

15694 

- 

127657 

- 

- 

2 

7923 

1.98 

99.04 

64036 

1.99 

99.68 

4 

4048 

3.88 

96.92 

32276 

3.96 

98.88 

8 

2202 

7.13 

89.09 

16521 

7.73 

96.59 

16 

1287 

12.19 

76.21 

8972 

14.23 

88.93 

32 

884 

17.75 

55.48 

5206 

24.52 

76.63 

Table  2:  Performance  results  for  the  matrix  multiply, 
efficiency  is  close  to  the  maximum. 

The  second  parallel  algorithm  we  studied  is  the  matrix  multiply.  Table  2  gives  timing  results 
for  small  matrices  (100  x  100  and  200  x  200).  For  larger  matrix  size,  speedups  are  near  from  the 
maximum.  This  can  be  seen  in  this  table;  for  a  32  nodes  configuration,  speedups  increase  from 
3.58  to  24.57  when  the  number  of  matrix  elements  quadruples.  However,  for  small  matrices,  the 
results  can  be  improved  by  using  the  weak  cache  coherence  protocol.  Indeed,  the  poor  performance 
is  always  due  to  the  same  effect:  “false-sharing”.  The  same  table  provides  timing  results  when  the 
parallel  loop  is  executing  with  weak  coherence.  For  the  small  matrix,  the  gain  in  performances  is 
impressive.  When  the  number  of  processors  is  set  to  32,  speedup  augments  from  3.58  to  17.75. 


1  200  X  200 

#  proc. 

1  Strong  coherence  | 

1  Weak  coherence 

1  Weak+Broadcast 

Times  (s) 

Speedup 

Eff. 

Speedup 

Eff. 

Times  (s) 

Speedup 

Eff. 

1 

125.99 

- 

125.99 

- 

- 

125.99 

- 

- 

2 

79.34 

IBH 

79.40 

66.34 

1.90 

66.34 

66.69 

1.89 

94.46 

4 

49.06 

37.07 

3.40 

84.97 

37.09 

3.40 

84.92 

8 

61.59 

2.05 

25.57 

23.99 

5.25 

65.65 

23.04 

5.47 

68.35 

16 

65.49 

1.92 

12.02 

20.61 

6.11 

38.21 

16.85 

7.48 

46.73 

32 

78.79 

1.60 

5.00 

23.62 

5.33 

16.67 

14.88 

8.47 

26.46 

i  500  X  500 

1 

1986.81 

- 

- 

1986.81 

- 

- 

1986.81 

- 

2 

1029.11 

1.93 

96.53 

1007.51 

1.97 

98.60 

1013.20 

1.96 

4 

562.52 

3.53 

88.30 

517.57 

3.84 

95.97 

522.38 

3.80 

8 

339.23 

5.86 

73.21 

276.17 

7.19 

89.93 

278.72 

7.13 

89.10 

16 

233.10 

8.52 

53.27 

163.98 

12.12 

75.73 

158.97 

12.50 

78.11 

32 

205.75 

9.66 

30.18 

124.71 

15.93 

49.79 

101.62 

19.55 

61.10 

Table  3:  Performance  results  for  the  MGS  algorithm. 

The  last  experiment  involved  the  Modified  Gram-Schmidt  algorithm  described  above.  This 
algorithm  consists  of  two  nested  loops.  We  added  some  directives  in  order  to  improve  the  efficiency 
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of  the  parallel  MGS  algorithm.  The  vector,  which  is  modified  in  the  sequential  section,  is  broadcast 
to  every  processor,  since  it  will  be  accessed  within  the  parallel  loop.  A  weak  cache  coherence 
protocol  is  also  associated  with  the  inner  loop  to  avoid  false  sharing.  A  detailed  study  of  this 
algorithm  can  be  found  in  {52,  53].  Table  3  summarizes  the  results  we  obtained  with  different 
strategies. 

Several  other  parallel  algorithms  and  appDcations  have  been  ported  to  KOAN.  Their  perfor¬ 
mance  results  are  presented  in  [54,  10]. 

4  High  Performance  Fortran. 

Recently  an  international  group  of  researchers  from  academia,  industry  and  government  labs  formed 
the  High  Performance  Fortran  Forum  aimed  at  providing  an  intermediate  approach  in  which  the 
user  and  the  compiler  share  responsibility  for  exploiting  parallelism.  The  main  goal  of  the  group 
has  been  to  design  a  high-level  set  of  standard  extensions  to  Fortran  called,  High  Performance 
Fortran  (HPF),  intended  to  exploit  a  wide  variety  of  parallel  architectures  [28,  40]. 

The  HPF  extensions  allow  the  user  to  carefuUy  control  the  distribution  of  data  across  the 
memories  of  the  target  machine.  However,  the  computation  code  is  written  using  a  global  name 
space  with  no  explicit  message  passing  statements.  It  is  then  the  compiler’s  responsibility  to  analyze 
the  distribution  annotations  and  generate  parallel  code  inserting  communication  statements  where 
required  by  the  computation.  Thus,  using  this  approach  the  programmer  can  focus  on  high-level 
algorithmic  and  performance  critical  issues  such  as  load  balance  while  allowing  the  compiler  system 
to  deal  with  the  complex  low-level  machine  specific  details. 

Earlier  efforts 

The  HPF  effort  is  based  on  research  done  by  several  groups,  some  of  which  are  described  below. 
The  language  IVTRAN  [47],  for  the  SIMD  machine  ILLIAC  IV,  was  one  of  the  first  languages  to 
allow  users  to  control  the  data  layout.  The  user  could  indicate  the  array  dimensions  to  be  spread 
across  the  processors  and  those  which  were  to  be  local  in  a  processor.  Combinations  resulting  in 
physically  skewed  data  were  also  allowed. 

In  the  context  of  MIMD  m2w:hines.  Kali  (and  its  predecessor  BLAZE)  [45,  46]  was  the  first 
language  to  introduce  user-specified  distribution  directives.  The  language  allows  the  dimensions  of 
an  array  to  be  mapped  onto  an  explicitly  declared  processor  array  using  simple  regular  distributions 
such  as  block,  cyclic  and  block-cyclic  and  more  complex  distributions  such  as  irregular  in  which  the 
address  of  each  element  is  explicitly  specified.  Simple  forms  of  user-defined  distribution  are  also 
permitted.  Kali  also  introduced  the  idea  of  dynamic  distributions  which  allow  the  user  to  change 
the  distribution  of  an  array  at  runtime.  The  parallel  computation  is  specified  using  forall  loops 
within  a  global  name  space.  The  language  also  introduced  the  concept  of  an  on  clause  which  aDows 
the  users  to  control  the  distribution  of  loop  iterations  across  the  processors. 

The  Fortran  D  project  [19]  follows  a  slightly  different  approach  to  specifying  distributions.  The 
distribution  of  data  is  specified  by  first  aligning  data  arrays  to  virtual  arrays  knows  as  decompo¬ 
sitions.  The  decompositions  are  then  distributed  across  an  implicit  set  of  processors  using  relative 
weights  for  the  different  dimensions.  The  language  allows  an  extensive  set  of  alignments  along 
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with  simple  regular  and  irregular  distributions.  All  mapping  statements  are  considered  executable 
statements,  thus  blurring  the  distinction  between  static  and  dynamic  distributions. 

Vienna  Fortran  [14,  68]  is  the  first  language  to  provide  a  complete  specification  of  distribution 
constructs  in  the  context  of  Fortran.  Based  largely  on  the  Kali  model,  Vienna  Fortran  allows  arrays 
to  be  aligned  to  other  arrays  and  which  are  then  distributed  across  an  explicit  processor  array.  In 
addition  to  the  simple  regular  and  irregular  distributions,  Vienna  Fortran  defines  a  generalized 
block  distribution  which  allows  unequal  sized  contiguous  segments  of  the  data  to  be  mapped  the 
processors.  Users  can  define  their  own  distribution  and  alignment  functions  which  can  then  be 
used  to  provide  a  precise  mapping  of  data  to  the  underlying  processors.  The  language  maintains 
a  clear  distinction  between  distributions  that  remain  static  during  the  execution  of  a  procedure 
and  those  which  can  change  dynamically,  allowing  compilers  to  optimize  code  for  the  different  the 
two  situations.  It  defines  multiple  methods  of  passing  distributed  data  across  procedure  bound¬ 
aries  including  inheriting  the  distribution  of  the  actual  arguments.  Distribution  inquiry  functions 
faciUtate  the  writing  of  library  functions  which  are  optimal  for  multiple  incoming  distributions. 

High  Performance  Fortran  effort  has  been  based  on  the  above  and  other  related  projects  [8,  27, 
38,  48,  58,  59,  60).  In  the  next  few  sub-sections  we  provide,  short  introduction  to  HPF  concentrating 
on  the  features  which  are  critical  to  parallel  performance. 

4.1  HPF  Overview 

High  Performance  Fortran^  is  a  set  of  extensions  for  Fortran  90  designed  to  allow  specification  of 
data  parallel  algorithms.  The  programmer  annotates  the  program  with  distribution  directives  to 
specify  the  desired  layout  of  data.  The  underlying  programming  model  provides  a  global  name 
space  and  a  single  thread  of  control.  Explicitly  parallel  constructs  allow  the  expression  of  fairly 
controlled  forms  of  paralleUsm,  in  particular  data  paraUebsm.  Thus,  the  code  is  specified  in  high 
level  portable  manner  with  no  expUcit  tasking  or  communication  statements.  The  goal  is  to  allow 
architecture  specific  compilers  to  generate  efficient  code  for  a  wide  variety  of  architectures  including 
SIMD,  MIMD  shared  and  distributed  memory  machines. 

Fortran  90  was  used  a  base  for  HPF  extensions  for  two  reasons.  First,  a  large  percentage  of 
scientific  codes  are  stiU  written  in  Fortran  (Fortran  77  that  is)  providing  programmers  using  HPF 
with  a  familiar  base.  Second,  the  array  operations  as  defined  for  Fortran  90  make  it  eminently 
suitable  for  data  parallel  algorithms. 

Most  of  the  HPF  extensions  are  in  the  form  of  directives  or  structured  comments  which  assert 
facts  about  the  program  or  suggest  implementation  strategies  such  as  data  layout.  Since  these 
are  directives  they  do  not  change  the  semantics  of  the  program  but  may  have  a  profound  effect 
on  the  efficiency  of  the  generated  code.  The  syntax  used  for  these  directives  such  that  if  HPF 
extensions  are  at  some  later  date  accepted  as  part  of  the  language  only  the  prefix,  !HPF$ ,  needs 
to  be  removed  to  retain  a  correct  HPF  program.  HPF  also  introduces  some  new  language  syntax 
in  the  form  of  data  parallel  execution  statements  and  a  few  new  intrinsics. 

*This  chapter  is  partially  based  on  the  High  Performance  Fortran  Language  Specification  draft  document  [28] 
which  h^«  been  jointly  written  by  several  of  the  participants  of  the  High  Performance  Fortran  Forum.  Also,  the 
specification  (as  described  here)  are  still  under  review  and  may  change  when  the  final  document  is  releaised. 
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Features  of  High  Performance  Fortran 

In  this  subsection  we  provide  a  brief  overview  of  the  new  features  defined  by  HPF.  In  the  next  few 
subsections  we  wiU  provide  a  more  detailed  view  of  some  of  these  features. 

•  Data  mapping  directives:  HPF  provides  an  extensive  set  of  directives  to  specify  the  distribu¬ 
tion  and  alignment  of  arrays. 

•  Data  parallel  execution  features:  The  FORALL  statement  and  construct  and  the  INDEPENDENT 
directive  can  be  used  to  specify  data  parallel  code.  The  concept  of  pure  procedures  callable 
from  parallel  constructs  has  also  been  defined. 

•  New  intrinsic  and  library  functions:  HPF  provides  a  set  of  new  intrinsic  functions  includ¬ 
ing  system  functions  to  inquire  about  the  underlying  hardware,  mapping  inquiry  functions 
to  inquire  about  the  distribution  of  the  data  structures  and  a  few  computational  intrinsic 
functions.  A  set  of  new  library  routines  have  also  been  defined  so  as  to  provide  a  standard 
interface  for  highly  useful  parallel  operations  such  as  reduction  functions,  combining  scatter 
functions,  prefix  and  suffix  functions,  and  sorting  functions. 

•  Extrinsic  procedures:  HPF  is  well  suited  for  data  parallel  programming.  However,  in  order 
to  accommodate  other  programming  paradigms,  HPF  provides  extrinsic  procedures.  These 
define  an  explicit  interface  and  allow  codes  expressed  using  a  different  paradigm,  such  as  an 
explicit  message  passing  routine,  to  be  called  from  an  HPF  program. 

•  Sequence  and  storage  association:  The  Fortran  concepts  of  sequence  and  storage  association® 
assume  an  underlying  linearly  addressable  memory.  Such  assumptions  create  a  problem  in 
architectures  which  have  a  fragmented  address  space  and  are  not  compatible  with  the  data 
distribution  features  of  HPF.  Thus,  HPF  places  restrictions  on  the  use  of  storage  and  sequence 
sissociation  for  distributed  arrays.  For  example,  arrays  that  have  been  distributed  can  not 
be  passed  as  actual  arguments  associated  with  dummy  arguments  which  have  a  different 
rank  or  shape.  Similarly,  arrays  that  have  been  storage  associated  with  other  arrays  can  be 
distributed  only  in  special  situations.  The  reader  is  referred  to  the  HPF  Language  specification 
document  [28]  for  full  details  of  these  restrictions  and  other  HPF  features. 

4.2  Data  Mapping  Directives 

A  major  part  of  the  HPF  extensions  are  aimed  at  specifying  the  alignment  and  distribution  of  the 
data  elements.  The  underlying  intuition  for  such  mapping  of  data  is  as  follows.  If  the  computations 
on  different  elements  of  a  data  structure  are  independent,  then  distributing  the  data  structure  will 
allow  the  computation  to  be  executed  in  parallel.  Similarly,  if  elements  of  two  data  structures 
are  used  in  the  same  computation,  then  they  should  be  aligned  so  that  they  reside  in  the  same 
processor  memory.  Obviously,  the  two  factors  may  be  in  conflict  across  computations,  giving  rise 
to  situations  where  data  needed  in  a  computation  resides  on  some  other  processor.  This  data 
dependence  is  then  satisfied  by  communicating  the  data  from  one  processor  to  another.  Thus,  the 

MnformaUy,  sequence  association  refers  to  the  Fortran  assumption  that  the  elements  of  an  array  are  in  particular 
order  (column-major)  and  hence  allows  redimensioning  of  arrays  across  procedure  boundaries.  Storage  association 
allows  COMMON  and  EQUIVALENCE  statements  to  constrain  and  align  data  items  relative  to  each  other. 
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Figure  2:  HPF  data  distribution  model 


main  of  goal  of  mapping  data  onto  processor  memories  is  to  increase  parallelism  while  minimizing 
communication  such  that  the  workload  across  the  processors  is  balanced. 

HPF  uses  a  two  level  mapping  of  data  objects  to  abstract  processors  as  shown  in  Figure  2. 
First,  data  objects  are  aligned  to  other  objects  and  then  groups  of  objects  are  distributed  on  a 
rectilinear  arrangement  of  abstract  processors. 

Each  array  is  created  with  some  mapping  of  its  elements  to  abstract  processors  either  on  entry  to 
a  program  unit  or  at  the  time  of  allocation  for  allocatable  arrays.  This  mapping  may  be  specified 
by  the  user  through  the  ALIGN  and  DISTRIBUTE  directives  or  in  the  case  where  complete 
specifications  are  not  provided  may  be  chosen  by  the  compiler. 


Processors  Directive 

The  PROCESSORS  directive  can  be  used  to  declare  one  or  more  rectilinear  arrangements  of 
processors  in  the  specification  part  of  a  program  unit.  If  two  processor  arrangements  have  the  same 
shape,  then  corresponding  elements  of  the  two  arrangements  are  mapped  onto  the  same  physical 
processor  thus  ensuring  that  objects  mapped  to  these  abstract  processors  wiU  reside  on  the  same 
physical  processor. 

The  intrinsics  NUMBER.OF-PROCESSORS  and  PROCESSOR.5HAPE  can  be  used  to 
determine  the  actual  number  of  physical  processors  being  used  to  execute  the  program.  This 
information  can  then  be  used  in  declaring  the  abstract  processor  arrangement. 


!HPF$  PROCESSORS 
1HPF8  PROCESSORS 
!HPF$  PROCESSORS 
!HPF$  PROCESSORS 


P(N) 

Q(  NUMBER_OFJ»ROCESSORS()) 
R(8,  NUMBER_OF  JPROCESSORS  ()/8) 
SCALAR? ROC 


Here,  P  is  a  processor  arrangement  of  size  IV,  the  size  of  Q  (and  the  shape  of  R)  is  dependent 
upon  the  number  of  physical  processors  executing  the  program  while  SCALARPROC  \s  conceptu¬ 
ally  treated  as  a  scalar  processor. 
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A  compiler  must  accept  any  processor  declaration  which  is  either  scalar  or  whose  total  num¬ 
ber  of  elements  match  the  number  of  physical  processors.  The  mapping  of  the  abstract  proces¬ 
sors  to  physical  processors  is  compiler-dependent.  It  is  expected  that  implementors  may  provide 
architecture-specific  directives  to  allow  users  to  control  this  mapping. 


Distribution  Directives 

The  DISTRIBUTE  directive  can  be  used  to  specify  the  distribution  of  the  dimensions  of  an  array 
to  dimensions  of  an  abstract  processor  arrangement.  The  different  types  of  distributions  allowed 
by  HPF  are:  BLOCK(expr),  CYCLIC(expr)  and  *. 

PARAMETER  {N  =  NUMBER_OF_PROCESSORS()) 


!HPF$  PROCESSORS  Q(  NUMBER_OF_PROCESSORS()) 
!HPF$  PROCESSORS  R(8,NUMBER_OFJ»ROCESSORS  ()/8) 


REAL  A(IOO),  B(200),  C(  100,200),  D(100,  200) 


!HPF$  DISTRIBUTE 
!HPF$  DISTRIBUTE 
!HPF$  DISTRIBUTE 
!HPF$  DISTRIBUTE 


A( BLOCK)  ONTO  Q 
B(  CYCLIC  (5)) 

C( BLOCK,  CYCLIC)  ONTO  R 
D(  BLOCK  (10),  *)  ONTO  Q 


In  the  above  examples,  A  is  divided  into  N  contiguous  blocks  of  elements  which  are  then  mapped 
onto  successive  processors  of  the  arrangement  Q.  The  elements  of  array  B  are  first  divided  into 
blocks  of  5,  which  are  then  mapped  in  a  wrapped  manner  across  the  processors  of  the  arrangement 
Q.  The  two  dimensions  of  array  C  are  individually  mapped  to  the  two  dimensions  of  the  processor 
arrangement  R.  The  rows  of  C  are  blocked  while  the  columns  are  cyclically  mapped.  The  one- 
dimensional  array  D  is  distributed  across  the  one-dimensional  processor  arrangement  Q  such  that 
the  second  axis  is  not  distributed.  That  is  each  row  of  the  array  is  mapped  as  a  single  object. 
To  determine  the  distribution  of  the  dimension,  the  rows  are  first  blocked  into  groups  of  10  and 
these  groups  are  then  mapped  to  successive  processors  of  Q.  In  this  case,  N  must  be  at  least  10  to 
accommodate  the  rows  of  D.  Note,  that  in  the  case  of  array  B,  the  compiler  chooses  the  abstract 
processor  arrangement  for  the  distribution. 

The  REDISTRIBUTE  directive  is  syntactically  similar  to  the  DISTRIBUTE  directive  but 
may  appear  only  in  the  execution  part  of  a  program  unit.  It  is  used  for  dynamically  changing  the 
distribution  of  an  array  and  may  only  be  used  for  arrays  that  have  been  declared  as  DYNAMIC  . 
The  only  difference  between  DISTRIBUTE  and  REDISTRIBUTE  directives  is  that  the  former 
can  use  only  specification  expressions  while  the  latter  can  use  any  expression  including  values 
computed  at  runtime. 


REAL  A(IOO) 

!HPF$  DISTRIBUTE  ( BLOCK  ),  DYNAMIC  ::  A 
k  =  ... 

!HPF$  REDISTRIBUTE  A(  CYCLIC  (k)) 
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Here,  A  starts  with  a  block  distribution  and  is  dynamically  remapped  to  a  cyclic  distribution  whose 
block  size  is  computed  at  runtime. 

When  an  array  is  redistributed,  arrays  that  are  ultimately  aligned  to  it  (see  next  subsection) 
are  also  remapped  to  maintain  the  alignment  relationship. 


Alignment  Directives 

The  ALIGN  directive  is  used  to  indirectly  specify  the  mapping  of  an  array  (the  alignee)  by  spec¬ 
ifying  its  relative  position  with  respect  to  another  object  (the  align-target)  which  is  ultimately 
distributed.  HPF  provides  a  variety  of  alignments  including  identity  alignment,  offsets,  axis  col¬ 
lapse,  axis  transposition,  and  replication  using  dummy  arguments  which  range  over  the  entire  index 
range  of  the  alignee.  Only  linear  expressions  are  allowed  in  the  specification  of  the  align-target  with 
the  restriction  that  a  align  dummy  can  appear  only  in  one  expression  in  an  ALIGN  directive.  The 
alignment  function  must  be  such  that  alignee  is  not  allowed  to  “wrap  around"  or  “extend  past  the 
edges"  of  the  align-target. 


!HPF$  ALIGN  A(:,:)  WITH  B(:,:) 
!HPF$  ALIGN  C(I)  WITH  D(l-5) 
!HPF$  ALIGN  E(I,*)  WITH  F(I) 
!HPF$  ALIGN  G(I)  WITH  H(I,*) 
!HPF$  ALIGN  R(I,J)  WITH  S(J,I) 


!  identity  alignment 
!  offset 
!  collapse 
!  replication 
!  transposition 


If  A  is  aligned  to  B  which  is  in  turn  aligned  with  C  then  A  is  considered  to  be  immediately  aligned 
to  B  but  ultimately  aligned  to  C.  Note,  that  intermediate  alignments  are  useful  only  to  provide  the 
“ultimate”  alignment  since  only  the  root  of  the  alignment  tree  can  be  distributed. 

The  REALIGN  directive  is  syntactically  similar  to  the  ALIGN  directive  but  may  appear  only 
in  the  execution  part  of  a  program  unit.  It  is  used  for  dynamically  changing  the  alignment  of  an 
array  and  may  only  be  used  for  arrays  that  have  been  declared  as  DYNAMIC .  As  in  the  case 
of  REDISTRIBUTE,  the  REALIGN  directive  can  use  computed  values  in  its  expression.  Note, 
that  only  an  object  which  is  not  the  root  of  an  alignment  tree  can  be  explicitly  realigned  and  that 
such  a  realignment  does  not  affect  the  mapping  of  any  other  array. 


Template  Directive 

In  certain  codes,  we  may  want  to  align  arrays  to  an  index  space  which  is  larger  than  any  of  the 
data  arrays  declared  in  the  program.  HPF  introduces  the  concept  of  template  as  an  abstract  index 
space.  Declaration  of  templates  use^  the  keyword  TEMPLATE  and  a  syntaoc  similar  to  that  of 
regular  data  arrays.  The  distinction  is  that  templates  do  not  take  any  storage. 

Consider  the  situation  where  two  arrays  of  size  N  x  ( A  -t- 1 )  and  {N  +  l)x  N  have  to  be  aligned 
such  that  bottom  right  corner  elements  are  mapped  to  the  same  processor.  This  can  be  done  as 
follows: 
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!HPP$  TEMPLATE  T(N+1,N+1) 


!HPF$  REAL  A(N,N+1),  B(N+1,N) 

!HPF$  ALIGN  A{I,J)  WITH  T(I+1,J) 

!HPF$  ALIGN  B(I,J)  WITH  T(I,J+1) 

!HPF$  DISTRIBUTE  T(  BLOCK ,  BLOCK ) 

As  seen  above,  templates  can  be  used  as  align-targets  and  may  be  distributed  using  a  DISTRIBUTE 
(or  REDISTRIBUTE )  directives  but  may  not  be  an  alignee. 

Procedure  Boundaries 

HPF  allows  distributed  arrays  to  be  passed  as  actual  arguments  to  procedures.  As  noted  before, 
HPF  places  restrictions  on  sequence  association,  therefore  the  rank  and  shape  of  the  actual  ar¬ 
guments  must  match  with  those  of  the  corresponding  dummy  arguments.  HPF  provides  a  wide 
variety  of  options  to  specify  the  distribution  of  the  dummy  argument.  The  user  can  specify  that 
the  distribution  of  the  actual  argument  be  inherited  by  the  dummy  argument.  In  other  cases,  the 
user  can  provide  a  specific  mapping  for  the  dummy  and  actual  argument  may  need  to  remapped  to 
satisfy  this  mapping.  If  the  actual  is  remapped  on  entry,  then  the  original  mapping  is  restored  on 
exit  from  the  procedure.  The  user  can  also  demand  that  the  actual  argument  be  already  mapped 
as  specified  for  the  dummy  argument.  In  this  case,  it  is  incumbent  upon  the  callee  to  explicitly 
remap  before  the  call  to  the  procedure.  In  the  presence  of  interface  blocks  such  a  remap  may  be 
implicitly  provided  by  the  compiler. 

HPF  also  provides  a  INHERIT  directive  which  specifies  that  the  template  of  the  actual  argu¬ 
ment  be  copied  and  used  as  the  template  for  the  dummy  argument.  This  makes  a  diflFerence  when 
only  a  subsection  of  an  array  is  passed  as  an  actual  argument.  Without  the  INHERIT  directive, 
the  template  of  the  dummy  argument  is  implicitly  assumed  to  be  the  same  shape  as  the  dummy 
and  the  dummy  is  aligned  to  the  template  using  the  identity  mapping. 

4.3  Data  Parallel  Constructs 

Fortran  90  has  syntax  to  express  data  parallel  operations  on  full  arrays.  For  example,  the  statement 
A  =  B  +  C  indicates  that  the  two  arrays  B  and  C  should  be  added  element  by  element  (in 
any  order)  to  produce  the  array  A.  The  two  main  reasons  for  introducing  these  features  is  the 
conciseness  of  the  expressions  (note  the  absence  of  explicit  loops)  and  the  possibility  of  exploiting 
the  undefined  order  of  elemental  operations  for  vector  and  parallel  machines.  HPF  extends  Fortran 
90  with  several  new  features  to  explicitly  specify  data  paraUeUsra.  The  FORALL  statement  and 
construct  generalize  the  Fortran  90  array  operations  to  allow  not  only  more  complicated  array 
sections  but  also  the  calling  of  pure  procedures  on  the  elements  of  arrays.  The  INDEPENDENT 
directive  can  be  used  to  specify  parallel  iterations. 

Forall  Statement 

The  FORALL  statement  extends  the  Fortran  90  array  operations  by  making  the  index  used  to 
range  over  the  elements  explicit.  Thus,  this  statement  can  be  used  to  make  an  array  assignment  to 
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array  elements  or  sections  of  arrays,  possibly  masked  with  a  scalar  logical  expression.  The  general 
form  the  FORALL  statement  is  as  follows: 

FORALL  (  triplet,  ...  [,  scalar-mask]) 
assignment 

where,  a  triplet  has  the  form: 
subscript  =  lower,  upper  (:  sfn'de] 

Here,  the  FORALL  header  may  have  multiple  triplets  and  assignment\s  a  arithmetic  or  pointer 
assignment.  First  the  lower  bound,  upper  bound  and  the  optional  stride  of  each  triplet  are  evaluated 
(ir  any  order).  The  cartesian  product  of  the  result  provides  the  valid  set  of  subscript  values  over 
which  the  mask  is  then  evaluated.  This  gives  rise  to  the  active  combinations.  The  right  hand 
side  of  the  assignment  is  then  evaluated  for  all  the  active  combinations  before  any  assignment  to 
corresponding  elements  on  the  left  hand  side. 

FORALL  (I=1,N,  J=2,N) 

A(I,J)  =  A(I,J-1)*B(I) 

In  the  above  example,  the  new  values  of  the  array  A  are  determined  by  the  old  values  of  A  in 
the  columns  on  the  right  and  the  array  B. 

Forall  Construct 

The  FORALL  construct  is  a  generalization  of  the  FORALL  statement  allowing  multiple  state¬ 
ments  to  be  associated  with  the  same  forall  header.  The  only  kind  of  statements  allowed  are 
assignment,  the  WHERE  statement  and  another  FORALL  statement  or  construct. 

FORALL  (  triplet,  ...  [,  scalar-masi^) 
statement 

END  FORALL 

Here,  the  header  is  evaluated  as  before  and  the  execution  of  one  statement  is  completed  for  all 
active  combination  before  proceeding  to  the  next  statement.  Thus,  conceptually  in  a  FORALL 
construct,  there  is  a  synchronization  before  the  assignment  to  the  left  hand  side  and  between  any 
two  statements.  Obviously,  some  of  these  synchronization  may  not  be  needed  and  can  be  optimized 
away. 

Pure  procedures 

HPF  has  introduced  a  new  attribute  for  procedures  called  PURE  which  allows  users  to  declare 
that  the  given  procedure  has  no  side  effects.  That  is  the  only  effects  of  the  procedure  are  ei¬ 
ther  the  value  returned  by  the  function  or  possible  changes  in  the  values  of  INTENT(OUT)  or 
INTENT(INOUT)  arguments.  HPF  defines  a  set  of  syntactic  constraints  that  must  be  followed 
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in  order  for  a  procedure  to  be  pure.  This  allows  the  compiler  to  easily  check  the  validity  of  ‘he 
declaration.  Note,  that  a  procedure  can  only  call  other  pure  procedures  to  remam  pure. 

Only  pure  functions  can  be  called  from  a  FORALL  statement  or  construct.  Since  pure  functions 
have  no  side-effects  other  than  the  value  returned,  the  function  can  be  called  for  the  active  set  of 
index  combinations  in  any  order. 

Independent  Directive 

The  INDEPENDENT  directive  can  be  used  with  a  DO  loop  or  a  FORALL  statement  or  con 
struct  to  indicate  that  there  are  no  cross-iteration  data  dependences.  Thus,  for  a  DO  loop  the 
directive  asserts  that  the  iterations  of  the  loop  can  be  executed  in  any  order  without  changing  the 
final  result.  Similarly  when  used  with  a  FORALL  construct  or  statement,  the  directive  asserts 
that  there  is  no  synchronization  required  between  the  executions  of  the  different  values  cf  the  active 
combination  set. 

With  a  DO  loop,  the  INDEPENDENT  directive  can  be  augmented  Wjth  a  list  of  variables 
which  can  be  treated  as  private  variables  for  the  purposes  of  the  iterations. 

!HPF$  INDEPENDENT,  NEW(X) 

DO  I  =  1,N 
X  =  B(I) 

A(f(I))  =  X 
END  DO  I  =  1,N 

Here,  the  INDEPENDENT  directive  is  asserting  that  the  function  /(/)  returns  a  permutation 
of  the  index  set,  i.e.,  no  two  iterations  are  going  to  assign  to  the  same  element  of  A.  Similarly, 
the  new  clause  asserts  that  the  loop  carried  dependence  due  to  the  variable  X  is  spurious  and  the 
compiler  can  execute  the  loops  by  (conceptually)  allocating  a  new  X  variable  for  each  iteration. 

4.4  Examples  of  HPF  Codes 

In  this  section  we  provide  two  code  fragments  using  some  of  the  HPF  features  described  above. 
The  first  is  the  Jacobi  iterative  algorithm  and  the  second  is  the  Modified  Gram-Schmidt  algorithm 
discussed  earlier. 

The  HPF  version  of  the  Jacobi  iterative  procedure  which  may  be  used  to  approximate  the 
solution  of  a  partial  differential  equation  discretized  on  a  grid,  is  given  below. 

!HPF$  processors  p(  number j3f_processors()) 
real  u(l:n,l:n},  f(l:n,l;n) 

!HPF$  align  u  f 
!HPF$  distribute  u  (*,  block) 
forall  (i=2;n-l,  j  =  ‘2:n-l) 

u(ij)  =  0.2r)  *  (f(ij)  +  u(i-l,  j)  -f  u{i+l,  j)  + 
u(i,  j-1)  +  u(i,  jq-l) 

end  forall 
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At  each  step,  it  updates  the  current  approximation  at  a  grid  point,  represented  by  the  array  u, 
by  computing  a  weighted  avers^e  of  the  values  at  the  neighboring  grid  points  and  the  value  of  the 
right  hand  side  function  represented  by  the  array  /. 

The  array  /  is  aligned  with  the  array  u  using  the  identity  alignment.  The  columns  of  u  (and 
thus  those  of  /  indirectly)  are  then  distributed  across  the  processors  executing  the  program.  The 
computation  is  expressed  using  a  FORALL  statement,  where  all  the  right  hand  sides  are  evaluated 
using  the  old  values  of  u  before  assignment  to  the  left  hand  side. 

To  reiterate,  the  computation  is  specified  using  a  global  index  space  and  does  not  contain  any 
explicit  data  motion  constructs.  Given  that  the  underlying  arrays  are  distributed  by  columns, 
the  edge  columns  will  have  to  be  communicated  to  neighboring  processors.  It  is  the  compiler’s 
responsibility  to  analyze  the  code  and  generate  parallel  code  with  appropriate  communication 
statements  inserted  to  satisfy  the  data  requirements. 

The  HPF  version  of  the  Modified  Grain-Schmidt  algorithm  is  given  below. 

real  v(n,n) 

!HPF$  distribute  v  (♦,  block) 
do  i  =  l,n 
tmp  =  0.0 
do  k  =  l,n 

tmp  =  tmp  +  v(k4)*v(k4) 
enddo 

xnorm  =  1.0  /  sqrt(tmp) 
do  k  =  l,n 

v(k4)  =  v{k4)  *  xnorm 
enddo 

!HPF$  indepedent,  new  (tmp) 
do  j  =  i+l,n 
tmp  =  0.0 
do  k  =  l,n 

tmp  =  tmp  +  v(k4)*v(kj) 
enddo 
do  k  =  l,n 

v(kj)  =  v(kj)  -  tmp*v(k4) 
enddo 
enddo 
enddo 

The  first  directive  declares  that  the  columns  of  the  array  t;  are  to  be  distributed  by  block  across 
the  memories  of  the  underlying  processor  set.  The  outer  loop  is  sequential  and  is  thus  executed  by 
all  processors.  Given  the  column  distribution,  in  the  ith  iteration  of  the  outer  loop,  the  first  two  k 
loops  would  be  executed  by  the  processor  owning  the  ith  column. 

Fortran  90  version  of  the  code  fragment,  not  shown  here,  would  have  used  array  construcUs  for  the  k  loops. 
This  would  make  the  parallelism  in  the  inner  loops  explicit. 
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The  second  directive  declares  the  j  loop  to  be  independent  and  tmp  to  be  a  new  variable.  Thus 
the  iterations  of  the  j  loop  can  be  executed  in  parallel,  i.e.,  each  processor  updates  the  columns 
that  it  owns  in  parallel.  Since  the  ith  column  is  used  for  this  update,  it  will  have  to  be  broadcast 
to  all  processors. 

The  distribution  of  the  columns  by  contiguous  blocks  implies  that  processors  will  become  idle 
as  the  computation  progresses.  A  cyclic  distribution  of  the  columns  would  eliminate  this  problem. 
This  can  be  achieved  by  replacing  the  distribution  directive  with  the  following: 

!HPF$  distribute  v  (*,  cyclic) 

This  declares  the  columns  to  distributed  cyclically  across  the  processors,  and  thus  forces  the  inner 
j  loop  to  be  strip-mined  in  a  cyclic  rather  than  in  a  block  fashion.  Thus,  all  processors  are  busy 
until  the  tail  end  of  the  computation. 

The  above  distributions  only  exploit  parallelism  in  one  dimension,  whereat  the  inner  k  loops 
can  also  run  in  parallel.  This  can  be  achieved  by  distributing  both  the  dimensions  of  v  as  follows; 

!HPF$  distribute  v  (block,  cyclic) 

Here,  the  processors  are  presumed  to  be  arranged  in  a  two-dimensional  mesh  and  the  array  is 
distributed  such  that  the  elements  of  a  column  of  the  array  are  distributed  by  block  across  a 
column  of  processors  whereas  the  columns  as  a  whole  are  distributed  cyclically.  Thus,  the  first  k 
loop  becomes  a  parallel  reduction  of  the  tth  column  across  the  set  of  processors  owning  the  ith 
column.  Similarly,  the  second  k  loop  can  be  turned  into  a  FORALL  statement  which  is  ex<  cuted 
in  parallel  by  the  column  of  processors  which  owns  the  ith  column.  The  second  set  of  k  loops, 
inside  the  j  loop,  can  be  similarly  parallelized. 

Overtill,  it  is  clear,  that  using  the  approach  advocated  by  HPF  allows  the  user  to  focus  on  the 
performance  critical  issues  at  a  very  high  level.  Thus,  it  is  easy  for  the  user  to  experiment  with  a 
different  distribution,  by  just  changing  the  distribute  directives.  The  new  code  is  then  recompiled 
before  running  on  the  target  machine.  In  contrast,  the  effort  required  to  change  the  program  if  it 
was  written  using  low-level  communication  primitives  would  be  much  more. 

5  Object  Parallelism  with  pC-f-^ 

pC-f-f-  is  an  experimental  extension  to  designed  to  allow  programmers  to  build  distributed 

data  structures  with  parallel  execution  semantics.  These  data  structures  are  organized  as  “concur¬ 
rent  aggregate”  collection  classes  which  can  be  aligned  and  distributed  over  the  memory  hierarchy 
of  a  parallel  machine  is  a  manner  modeled  on  the  High  Performance  Fortran  Forum  (HPF)  di¬ 
rectives  for  Fortran  90.  The  first  version  of  the  compiler  is  a  preprocessor  which  generates  Single 
Program  Multiple  Data  (SPMD)  C++  code  which  runs  on  the  Thinking  Machines  ('M-.5,  the  Intel 
Paragon,  the  BBN  TC2000  and  the  Sequent  series  of  machines.  As  HPF  becomes  available  on  these 
systems  future  versions  of  the  compiler  will  allow  object  level  linking  between  pC-f-f  distributed 
collections  and  HPF  distributed  arrays. 

The  basic  concept  of  pC-f-f  is  the  notion  of  a  distributed  collection,  which  is  a  type  of  concurrent 
aggregate  “container  class”  [15,  35].  More  specifically,  a  collection  is  a  structured  set  of  objects 
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distributed  across  the  processiag  elements  of  the  computer.  A  runtime  system  uses  the  memory 
hierarchy  and  processor  interconnect  topology  of  the  target  machine  to  guide  the  distribution  of 
coOection  elements.  A  collection  can  be  an  Array,  a  Grid,  a  Tree,  or  any  other  partitionable  data 
structure. 

Collections  have  the  following  components: 

•  A  coDection  class  describing  the  basic  topology  of  the  set. 

•  A  size  or  shape  for  each  instance  of  the  collection  class.  For  example,  the  dimensions  of  an 
array  or  the  height  of  a  tree. 

•  A  base  type  for  collection  elements.  This  can  be  any  C-f-+  type  or  class.  For  example,  one 
can  define  an  Array  of  Floats,  or  a  Grid  of  Finite  Elements,  or  Matrix  of  Complex,  or  a  Tree 
of  Xs,  where  X  is  the  class  of  each  node  in  the  tree. 

•  A  Distribution  object.  The  distribution  describes  an  abstract  coordinate  system  that  will  be 
distributed  over  the  available  memory  modules  of  the  target  by  the  run-time  system. 

•  A  function  object  caUed  the  Alignment.  This  function  maps  collection  elements  to  the  abstract 
coordinate  system  of  the  Distribution  object. 

The  pC-f-1-  language  has  a  library  of  standard  collection  classes  that  may  be  used  (or  subclassed) 
by  the  programmer  [36,  49,  17,  20].  This  includes  collection  classes  such  as  DistributedArray, 
DistributedMatrix,  Distributed  Vector,  and  DistributedGrid.  To  illustrate  the  points  above,  consider 
the  problem  of  creating  a  distributed  5  by  .5  matrix  of  floating  point  numbers.  We  begin  by  building 
a  Distribution.  A  distribution  is  defined  by  its  number  of  dimensions,  the  size  in  each  dimension  and 
how  the  elements  are  mapped  to  the  processors.  In  HPF  [28]  this  mapping  is  called  a  distribution. 
Current  distributions  include  BLOCK,  CYCLIC  and  WHOLE,  but  more  general  forms  will  be 
added  later.  Let  us  assume  that  the  distribution  is  distributed  over  the  processor’s  memories  by 
mapping  Whole  rows  of  the  distribution  to  individual  processors  using  a  Cyclic  pattern  where  the 

row  is  mapped  to  processor  memory  i  mod  P,  on  a  P  processor  machine. 

pC-b-b  uses  a  special  implementation  dependent  library  clz^s  called  Processors.  In  the  current 
implementation,  it  represents  the  set  of  all  processors  available  to  the  program  at  run  time.  To 
build  a  distribution  of  some  size,  say  7  by  7,  one  would  write 

Processor  P; 

Distribution  myDist (7,  7,  ftp.  Cyclic,  Whole); 

Next,  we  create  an  alignment  object  called  myAlign  that  defines  a  domain  and  function  for 
mapping  the  matrix  to  the  distribution.  The  matrix  A  can  be  defined  using  the  library  collection 
class  DistributedMatrix  with  a  base  type  of  Float. 

Align  my Align(5,  5,  "[ALIGH(  domainCi]  [j] ,  myDist [i] [j] )] "  ); 
Di8tributedMatrix<Float>  A(myDist,  myAlign); 

The  collection  constructor  uses  the  alignment  object,  myAlign,  to  define  the  size  and  dimension 
of  the  collection.  The  mapping  function  is  described  by  a  text  string  corresponding  to  the  HPF 
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Figure  3:  Alignment  and  Distribution 

alignment  directive.  It  defines  a  mapping  from  a  domain  structure  to  a  distribution  structure  using 
dummy  index  variables. 

The  intent  of  this  two  stage  mapping,  as  it  was  originally  designed  for  HPF,  is  to  aUow  the 
distribution  to  be  a  frame  of  reference  so  that  different  arrays  could  be  aligned  with  each  other  in  a 
manner  that  promotes  memory  locality.  For  example,  suppose  we  wish  to  perform  a  matrix  vector 
multiply.  Since  the  DistributedMatrix  and  DistributedVector  library  classes  provide  many  common 
functions  through  C++  function  overloading,  a  matrix  vector  multiply  is  simply  written  as 

Y  *  A*X; 

where  X  and  Y  are  distributed  arrays.  While  the  semantic  meaning  and  computed  result  of  the 
expression  is  independent  of  alignment  and  distribution,  performance  is  best  if  the  alignment  of 
the  operands  matches  the  library  function  for  matrix  vector  multiply.  In  this  case,  the  algorithm 
broadcasts  the  vector  operand  along  the  columns  of  the  array  and  then  performs  a  reduction  along 
rows.  Aligning  X  along  with  the  first  row  of  the  matrix  A,  and  Y  with  the  first  column  yields  the 
best  performance.  The  vectors  are  declared  by 

Align  XAlign(5,  "[ALIGNC  XCi],  myDistCO] CiJ)]") ; 

Align  YAlignCS,  "[ALIGN(  YCi] ,  myDistfi] [0] )] ") ; 

DistributedVector<Float>  X(myDist,  XAlign) ; 

DistributedVector<Float>  YfmyDist,  YAlign) ; 

The  two  stage  mapping  process  for  this  example  is  illustrated  in  Figure  4. 

5.1  Collection  Functions  and  Parallelism 

There  are  two  forms  of  concurrency  in  pC++.  One  is  based  on  the  concurrent  application  of  a 
method  function,  associated  with  the  element  class  across  the  entire  collection,  and  the  other  type 
is  associated  with  special  functions  that  are  invoked  as  a  set  of  parallel  threads  one  running  on 
each  processor.  More  precisely,  a  collection  is  a  set  of  element  objects.  A  local  collection  is  the 
subset  of  elements  mapped  to  one  processor  by  the  alignment  and  distribution  functions.  Each 
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local  collection  is  realized  as  a  Processor  Object  and  there  is  an  cissociated  thread  of  computation 
that  executes  all  method  functions  that  modify  or  access  the  local  elements. 

The  memory  model  used  by  pC++  is  not  shared.  As  with  HPF  Fortran,  there  is  a  single  main 
thread  of  computation  and  parallel  operations  are  invoked  from  that  thread.  Collection  elements 
are  distributed  over  the  processor  objects  which  each  have  a  private  address  space.  Global  data, 
which  can  be  accessed  and  modified  by  the  main  thread  is  visible  to  the  processor  objects,  but  a 
processor  object  cannot  modify  Global  data.  Each  processor  object  can  read  and  write  its  local 
collection  of  elements,  but  the  only  way  a  processor  object  or  the  main  thread  of  execution  can 
access  remote  collection  elements  is  through  special  kernel  functions  which  which  provide  a  copy  of 
remote  collection  elements. 

A  collection  class  C  is  a  data  type  that  is  parameterized  by  the  class  of  the  element,  C 
<  EleinentType  >.  Collections  have  two  types  of  methods;  the  standard  public,  private  and  pro¬ 
tected  methods  of  any  normal  class;  and  a  set  of  fields  and  methods  that  are  added  to  the  element 
class  to  provide  access  to  the  collection  structure.  This  additional  family  of  fields  and  methods  are 
called  MethodOfElement  fields. 

Syntactically,  a  collection  class  takes  the  form: 

collection  CollectionName:  ParentCollection  { 
public: 
private: 
protected: 

//  Field  variables  declaired  here  are  local  to  each 
//  processor  object. 

//  Methods  declared  here  are  executed  in  parallel  by 
//  the  associated  processor  object  thread. 

MethodOfElement : 

//  Field  variables  declared  here  are  added  to  each  element 
//  Methods  declared  here  are  added  to  the  element  class. 

//  These  methods  are  the  "data  parallel"  functions. 

} 

Data  fields  defined  in  the  public,  private  and  protected  areas  are  duplicated  in  each  processor 
object.  Methods  in  these  areas  are  executed  by  the  threads  of  the  processor  objects. 

5.2  An  Example:  The  Gram-Schmidt  Algorithm 

To  illustrate  these  ideas  we  will  consider  the  same  Gram-Schmidt  algorithm  discussed  earlier.  pC-f -f 
programmers  work  by  building  collections  classes  derived  from  the  base  library.  Because  Gram- 
Schmidt  works  on  column  vectors  of  a  matrix,  we  wiU  cast  our  matrix  as  a  distributed  collection  of 
column  vectors.  Consequently,  we  shall  assume  we  have  a  library  of  double  precision  vectors  which 
have  all  the  standard  vector-vector  and  vector-scalar  operators. 
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class  VectorC 
public: 


Vector (int  n) ;  //a  constructor. 

Vector  k  operator  *=(double); 

// 

V  » 

V  *  3.14 

double  dotProduct (Vector  ♦) ; 

// 

the 

dot  product 

Vector  k  operator  -^(Vector) ; 

// 

V  = 

V  -  W 

Vector  operator  * (double); 

// 

mult 

.  expression 

We  will  define  a  collection  MyMatrix  which  will  be  a  distributed  array  of  elements  of  class 
Vector.  The  matrix  object  and  the  Grain-Schmidt  operation  will  be  invoked  as 

main(){ 

Processor  P; 
int  n  »  100; 

Distribution  myDistCn,  tP,  Cyclic): 

Align  myAlignCn,  "[ALIGNC  domain [i] ,  myDist[i])]"  ); 

MyMatrix<Vactor>  MCmyDist,  myAlign,  n) ; 

M . gramSchmidt (n) ; 

} 

This  declares  M  to  be  a  MyMatrix  collection  of  size  n  of  elements  of  class  Vector.  The  extra 
parameter  n  on  the  declaration  of  M  is  passed  to  the  element  constructor  so  that  each  vector 
element  has  size  n.  The  function  gramSchmidt()  will  be  a  processor  object  parallel  function  of  the 
coUection  which  is  defined  as 

collection  MyMatrix:  DistributedArray{ 
public: 

void  GramSchmidt (int  n) ; 

MethodOf Element ; 

void  update  (El ementType  *') ; 
virtual  El ementType  t  operator  *« (double) ; 
virtual  double  dotProduct(£lementType  *) ; 
virtual  El ementType  k  operator  -= (El ementType) ; 
virtual  ElementType  operator  ■)< (double); 

> 

The  element  level,  data  parallel  functions  in  this  collection  include  a  method  update  which  will  be 
described  below,  and  four  virtual  functions  which  are  provided  by  the  element  class  which,  in  our 
case,  is  Vector.  Because  the  collection  is  defined  separately  from  the  element,  if  we  wish  to  assume 
the  element  has  special  properties,  these  are  listed  as  virtual  functions.  In  the  case  of  the  Gram- 
Schmidt  algorithm  we  need  to  be  able  to  compute  the  dot  product  of  vectors,  multiply  vectors  by 
a  scalar  and  subtract  a  multiple  of  one  vector  from  another. 

The  Gram-Schmidt  function  is  nearly  a  direct  translation  of  the  program  in  section  3.0. 
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▼oid  MjMatrix: '.graaSchaldtCint  nX 
El4Hi«ntType  «v; 
int  i; 

double  trap; 

for(i  *  0;  i  <  n;  i++){ 

V  *  thi8->Get_Elraent(i) ; 
trap  *  v->dotProduct(T) ; 

V  *■  1.0  /  sqrt(trap); 

(♦this) [i+1  :  n-1] .update (v) ; 

} 

> 

Id  this  program  gramSchmidt(nJ  is  a  collection  public  function  which  means  that  it  is  invoked 
on  each  processor  object.  The  main  loop  first  extract  the  column  vector  element.  The  pointer 
V  obtained  by  the  kernel  function  Get.Eleinent(i)  references  a  copy  of  the  i‘^  element  if  it  is  not 
part  of  the  local  coDection  of  the  invoking  processor  object.  Otherwise,  v  references  the  actual 
element.  Notice  that  each  processor  thread  then  duplicates  the  work  of  computing  the  dot  product 
and  normalizing  its  copy  of  v. 

The  element  function  update(v)  is  invoked  in  “data  parallel”  mode  on  each  element  in  the  local 
collection  that  has  indexes  in  the  given  subrange.  In  pC++  this  is  accomplished  with  an  expression 
of  the  form 

collection  .  elraentMethodO 

which  invokes  the  element  method  function  “in  parallel”  on  each  of  the  elements  of  the  coUection. 
To  invoke  the  method  on  a  subrange  we  use  a  Fortran  90  style  triplet 

collection  [  lover  :  upper  :  stride  ]  .  elementNethodO 

The  parallel  operation  update  is  identical  to  the  “DoShared”  loop  in  the  Fortran- S  program. 

void  MyMatrix: :  update  (Element  Type  *v)-C 
double  trap; 

trap  ®  thi8->dotProduct(v) ; 

♦this  -a  v^trap; 

}; 

There  are  two  further  observations  that  should  be  made  about  this  program.  First,  the  use 
of  Get-ElementO  by  each  processor  object  can  create  a  serial  section.  Each  processor  object 
other  than  the  owner  of  the  element  will  request  a  a  copy.  A  more  efficient  program  would 
use  a  coordinated  element  broadcast.  Element JBroadcastQ,  to  make  sure  each  processor  object 
would  get  a  copy  in  the  smallest  amount  of  time.  Second,  and  more  important,  is  the  choice  of 
data  distribution.  In  our  case  we  have  selected  a  cyclic  distribution  so  that  as  i  increases  in  the 
expression  [*  +  1  :  n  -  l].update{v),  a  majority  of  processors  can  participate  for  as  long  as  possible. 
A  block  distribution  would  decrease  the  parallelism  much  faster. 
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6  Conclusion 


In  this  paper  we  have  examined  three  different  approaches  to  programming  scientific,  data-parallel 
applications. 

Fortran-S  plus  SVM  provides  the  user  with  a  familiar  model;  Fortran  77  plus  annotations 
to  distribute  loops  over  processors.  Initial  experiments  with  the  KOAN  SVM  system  look  very 
promising,  but  we  need  much  more  experience  with  large  applications  on  large  new  systems  before 
we  can  declare  success.  In  the  future  we  expect  that  more  shared  virtual  memory  systems  will  be 
implemented  on  a  variety  of  massively  parallel  systems.  While  the  details  of  each  system  will  vary, 
the  Fortran-S  project  demonstrates  that  the  compiler  technology  exists  to  make  this  model  work. 

High  Performance  Fortran  provides  a  high  level  approach  to  data  paraDel  programming  for  a 
wide  variety  of  architecture.  Initial  experience  has  shown  that  the  directives  as  currently  provided 
by  HPF  are  adequate  for  simple  scientific  codes.  However,  it  is  also  clear  that  HPF  does  not  have 
enough  expressive  power  to  specify  the  distributions  required  for  other  types  of  codes  such  as  multi- 
block  and  unstructured  computations,  adaptive  computations  and  multi-disciplinary  applications 
which  require  integrating  different  types  of  parallel  programming  paradigms. 

Currently  there  are  no  existing  compilers  for  HPF;  several  vendors  have  promised  initial  im¬ 
plementations  in  the  near  future.  However,  several  research  projects  have  built  prototype  com¬ 
pilers  for  HPF-like  languages.  This  includes  the  Kali  compiler  [33],  the  SUPERB  project  [67]  on 
which  the  Vienna  Fortran  compiler  is  based,  the  Fortran  D  compiler  [29]  and  several  other  ef¬ 
forts  [8,  24,  27,  32,  38,  48,  58,  59,  60]  that  have  contributed  to  the  overall  goal  of  compiling  global 
name  space  programs  for  distributed  memory  SIMD  and  MIMD  machines. 

pC-f--l-  is  just  one  example  of  a  number  of  efforts  to  add  parallelism  to  C-h-h.  While  pC-h-h 
has  been  ported  to  a  wide  variety  of  machines  including  the  TMC  CM-5,  Intel  Paragon,  BBN 
TC2000  and  the  KSR-1,  it  does  have  serious  drawbacks.  First,  it  relies  on  an  extension  to  the 
C-l-1-  language.  While  not  a  large  departure  from  C-f -I-,  the  collection  plus  processor  object  model 
requires  considerable  sophistication  on  the  part  of  the  user  to  use  correctly.  Also  the  common 
alternative,  building  class  libraries  that  operate  in  SPMD  parallel  execution  is  very  popular  and  it 
does  not  require  extensions  to  the  language.  In  the  future,  the  success  of  object  parallel  extension 
to  C-f-f-  win  depend  on  providing  more  functionality  than  Fortran-S  or  HPF.  The  feature  that  will 
be  important  are  heterogeneous  (polymorphic),  dynamic  collections  and  nested  data  parallelism. 

We  have  not  attempted  a  complete  survey  of  the  parallel  programming  landscape.  The  three 
parallel  programming  language  extensions  described  here  represent  only  a  small  fraction  of  the 
approaches  currently  being  investigated.  It  is  clear  that  this  is  an  area  that  will  continue  to 
undergo  rapid  evolution.  Different  application  areas  may  require  different  programming  paradigms 
and  some  multi-disciplinary  problems  will  need  a  combination  of  programming  styles. 
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